Speeding Up Dynamic Search Methods in Speech Recognition

نویسندگان

  • Gábor Gosztolya
  • András Kocsor
چکیده

In speech recognition huge hypothesis spaces are generated. To overcome this problem dynamic programming can be used. In this paper we examine ways of speeding up this search process even more using heuristic search methods, multi-pass search and aggregation operators. The tests showed that these techniques can be applied together, and their combination could significantly speed up the recognition process. The run-times we obtained were 22 times faster than the basic dynamic search method, and 8 times faster than the multi-stack decoding method. In speech recognition enormous hypothesis spaces arise. To handle them we can use dynamic programming, where we can avoid calculating the same values several times, which leads to a dramatic speed-up of a speech recognizer system. But this is not enough for real-world applications, hence we have to look for other ways of making improvements while preserving the recognition accuracy. Here we carry out experiments using search heuristics, aggregation operators and multi-pass search, and apply ideas for speeding up the heuristic search. 1 The Speech Recognition Problem We have a speech signal given by a series of observations A = a1 . . . at, and a set of phoneme sequences W . We look for the word ŵ ∈ W = argmaxP (w|A) which, via Bayes’ theorem, is equivalent to ŵ = argmax(P (A|w) ·P (w))/P (A). P (A) is the same for all w, so ŵ = argmaxP (A|w)P (w). Let w be o1o2 . . . on, as oj is the jth phoneme of w. Let A1, . . . , An be non-overlapping segments of A. We assume that the phonemes are independent, i.e. P (A|w) can be obtained from P (A1|o1), . . . , P (An|on). To calculate P (A|w), we can use aggregation operators at two levels: g1 supplies the P (Aj |oj) values as g1(P (atj−1 |oj), . . . , P (atj |oj)), while g2 is used to construct P (A|w) as g2(P (A1|o1), . . . , P (An|on)). Instead of a probability p we will use a cost c = −ln p. g1 will be the addition operator. A hypothesis is a pair of phoneme series and segment series. The dynamic programming method uses a table with the ai speech frames indexing the columns and the phoneme-sequences indexing the rows. A cell holds the lowest cost of the hypotheses having its phoneme-sequence and ending at its frame. To compute the value of a cell we take the value of an earlier frame and its M. Ali and F. Esposito (Eds.): IEA/AIE 2005, LNAI 3533, pp. 98–100, 2005. c © Springer-Verlag Berlin Heidelberg 2005 Speeding Up Dynamic Search Methods in Speech Recognition 99 phoneme-sequence without its last phoneme, and add up the cost of this last phoneme on the interleaving frames. The result is the minimum of these sums. 2 Speeding Up the Recognition Process The dynamic programming search technique, despite its effectiveness, tends to be quite slow. In this section we discuss some methods that speed it up while keeping the recognition accuracy at an acceptable rate. Heuristic Search Methods. These techniques fill only a part of the table. So the result will not always be optimal, but we can get a notable speed-up with little or no loss in accuracy. The multi-stack decoding algorithm fills a fixed number (stack size) of cells (the ones with the lowest costs) for a row. The Viterbi beam search fills the cell with the best value, and the cells close to it defined by a beam width parameter. Here we used the multi-stack approach. Speed-Up Improvements. In earlier works [1] we presented some speed-up ideas for the multi-stack decoding algorithm, which we also want to use here. i) One possibility is to combine multi-stack decoding with a Viterbi beam search. At each column, belonging to one time instance, we fill only a fixed number of cells, and also discard those which are far from the best-scoring value. ii) Another approach is based on the fact that the later the time instance, the fewer hypotheses (and filled cells) are need. Thus we filled s ·mi cells belonging to the ai frame, where 0 < m < 1 and s is the original stack size parameter. iii) Actually, we need to fill more cells at those speech frames close to pronounced phoneme bounds. We trained an ANN to estimate whether a given time instance was a phoneme bound or not. Then we constructed a function that approximates the stack size based on the output of this ANN. Multi-pass Search. Multi-pass methods work in several steps: in the first pass the worse hypotheses are discarded because of some condition requiring low computational time. We reduced the number of phoneme groups for this reason. In later passes only the remaining hypotheses are examined, but with a more detailed phoneme grouping. The last pass (P0) uses the original phoneme set. To create the phoneme-sets first a distance function of the original ph1, . . . , phm phonemes is defined: d(phi, phj) is based on the ratio of phi-s classified as phj and vice versa. We can use the higher value (d) or the average (d) as the metric. The distance between phoneme-groups can be the minimum distance between their phones (Dmin), or the maximum (Dmax) [2]. The recognition steps using the resulting phoneme-sets were P1 and P2.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Speeding up Dynamic Programming

A number of important computational problems in molecular biology, geology, speech recognition, and other areas, can be expressed as recurrences which have typically been solved with dynamic programming. By using more sophisticated data structures, and by taking advantage of further structure from the applications, we speed up the computation of several of these recurrences by one or two orders...

متن کامل

Robot Arm Performing Writing through Speech Recognition Using Dynamic Time Warping Algorithm

This paper aims to develop a writing robot by recognizing the speech signal from the user. The robot arm constructed mainly for the disabled people who can’t perform writing on their own. Here, dynamic time warping (DTW) algorithm is used to recognize the speech signal from the user. The action performed by the robot arm in the environment is done by reducing the redundancy which frequently fac...

متن کامل

Effective lexical tree search for large vocabulary continuous speech recognition

In this paper, we present an e cient calculation of the factored LM probabilities for speeding up the large vocabulary continuous speech recognition. We introduced a novel technique based on the independent calculation of the factored LM probability. The basic idea of the proposed method is that each factored LM probability is calculated on-demand for a new combination of a previous word hypoth...

متن کامل

Improvements in search algorithm for large vocabulary continuous speech recognition

Current time-synchronous beam-search algorithm is improved from two aspects for speeding up large vocabulary continuous speech recognition. Single-triphone-tree structure is proposed to take instead of the tree copy technique for simplifying the search computation and saving the memory . By one kind of special-designed token propagation strategy, the n-gram language model can be integrated into...

متن کامل

Improving of Feature Selection in Speech Emotion Recognition Based-on Hybrid Evolutionary Algorithms

One of the important issues in speech emotion recognizing is selecting of appropriate feature sets in order to improve the detection rate and classification accuracy. In last studies researchers tried to select the appropriate features for classification by using the selecting and reducing the space of features methods, such as the Fisher and PCA. In this research, a hybrid evolutionary algorit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005